116 research outputs found
Improved Accuracy and Parallelism for MRRR-based Eigensolvers -- A Mixed Precision Approach
The real symmetric tridiagonal eigenproblem is of outstanding importance in
numerical computations; it arises frequently as part of eigensolvers for
standard and generalized dense Hermitian eigenproblems that are based on a
reduction to tridiagonal form. For its solution, the algorithm of Multiple
Relatively Robust Representations (MRRR) is among the fastest methods. Although
fast, the solvers based on MRRR do not deliver the same accuracy as competing
methods like Divide & Conquer or the QR algorithm. In this paper, we
demonstrate that the use of mixed precisions leads to improved accuracy of
MRRR-based eigensolvers with limited or no performance penalty. As a result, we
obtain eigensolvers that are not only equally or more accurate than the best
available methods, but also -in most circumstances- faster and more scalable
than the competition
Evaluating Asymmetric Multicore Systems-on-Chip using Iso-Metrics
The end of Dennard scaling has pushed power consumption into a first order
concern for current systems, on par with performance. As a result,
near-threshold voltage computing (NTVC) has been proposed as a potential means
to tackle the limited cooling capacity of CMOS technology. Hardware operating
in NTV consumes significantly less power, at the cost of lower frequency, and
thus reduced performance, as well as increased error rates. In this paper, we
investigate if a low-power systems-on-chip, consisting of ARM's asymmetric
big.LITTLE technology, can be an alternative to conventional high performance
multicore processors in terms of power/energy in an unreliable scenario. For
our study, we use the Conjugate Gradient solver, an algorithm representative of
the computations performed by a large range of scientific and engineering
codes.Comment: Presented at HiPEAC EEHCO '15, 6 page
Modeling power consumption of 3D MPDATA and the CG method on ARM and Intel multicore architectures
We propose an approach to estimate the power consumption of algorithms, as a function of the frequency and number of cores, using only a very reduced set of real power measures. In addition, we also provide the formulation of a method to select the voltage–frequency scaling–concurrency throttling configurations that should be tested in order to obtain accurate estimations of the power dissipation. The power models and selection methodology are verified using two real scientific application: the stencil-based 3D MPDATA algorithm and the conjugate gradient (CG) method for sparse linear systems. MPDATA is a crucial component of the EULAG model, which is widely used in weather forecast simulations. The CG algorithm is the keystone for iterative solution of sparse symmetric positive definite linear systems via Krylov subspace methods. The reliability of the method is confirmed for a variety of ARM and Intel architectures, where the estimated results correspond to the real measured values with the average error being slightly below 5% in all cases
Using graphics processors to accelerate the computation of the matrix inverse
We study the use of massively parallel architectures for computing a matrix
inverse. Two different algorithms are reviewed, the traditional approach based on
Gaussian elimination and the Gauss-Jordan elimination alternative, and several high
performance implementations are presented and evaluated. The target architecture is a
current general-purpose multi-core processor (CPU) connected to a graphics processor
(GPU). Numerical experiments show the efficiency attained by the proposed implementations
and how the computation of large-scale inverses, which only a few years
ago would have required a distributed-memory cluster, take only a few minutes on a
hybrid architecture formed by a multi-core CPU and a GPU
Toward a modular precision ecosystem for high performance computing
[EN] With the memory bandwidth of current computer architectures being significantly slower than the (floating point) arithmetic performance, many scientific computations only leverage a fraction of the computational power in today's high-performance architectures. At the same time, memory operations are the primary energy consumer of modern architectures, heavily impacting the resource cost of large-scale applications and the battery life of mobile devices. This article tackles this mismatch between floating point arithmetic throughput and memory bandwidth by advocating a disruptive paradigm change with respect to how data are stored and processed in scientific applications. Concretely, the goal is to radically decouple the data storage format from the processing format and, ultimately, design a "modular precision ecosystem" that allows for more flexibility in terms of customized data access. For memory-bounded scientific applications, dynamically adapting the memory precision to the numerical requirements allows for attractive resource savings. In this article, we demonstrate the potential of employing a modular precision ecosystem for the block-Jacobi preconditioner and the PageRank algorithm-two applications that are popular in the communities and at the same characteristic representatives for the field of numerical linear algebra and data analytics, respectively.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Impuls und Vernetzungsfond of the Helmholtz Association under grant VH-NG-1241. G Flegar and ES Quintana-Ortí were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 OPRECOMP .Anzt, H.; Flegar, G.; Gruetzmacher, T.; Quintana-Orti, ES. (2019). Toward a modular precision ecosystem for high performance computing. International Journal of High Performance Computing Applications. 33(6):1069-1078. https://doi.org/10.1177/109434201984654710691078336Anzt, H., Dongarra, J., & Quintana-Ortí, E. S. (2015). Adaptive precision solvers for sparse linear systems. Proceedings of the 3rd International Workshop on Energy Efficient Supercomputing - E2SC ’15. doi:10.1145/2834800.2834802Baboulin, M., Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Langou, J., … Tomov, S. (2009). Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications, 180(12), 2526-2533. doi:10.1016/j.cpc.2008.11.005Buttari, A., Dongarra, J., Langou, J., Langou, J., Luszczek, P., & Kurzak, J. (2007). Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems. The International Journal of High Performance Computing Applications, 21(4), 457-466. doi:10.1177/1094342007084026Carson, E., & Higham, N. J. (2017). A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems. SIAM Journal on Scientific Computing, 39(6), A2834-A2856. doi:10.1137/17m1122918Carson, E., & Higham, N. J. (2018). Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2), A817-A847. doi:10.1137/17m1140819Göddeke, D., Strzodka, R., & Turek, S. (2007). Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations. International Journal of Parallel, Emergent and Distributed Systems, 22(4), 221-256. doi:10.1080/17445760601122076Grützmacher, T., & Anzt, H. (2018). A Modular Precision Format for Decoupling Arithmetic Format and Storage Format. Euro-Par 2018: Parallel Processing Workshops, 434-443. doi:10.1007/978-3-030-10549-5_34Grutzmacher, T., Anzt, H., Scheidegger, F., & Quintana-Orti, E. S. (2018). High-Performance GPU Implementation of PageRank with Reduced Precision Based on Mantissa Segmentation. 2018 IEEE/ACM 8th Workshop on Irregular Applications: Architectures and Algorithms (IA3). doi:10.1109/ia3.2018.00015Hegland, M., & Saylor, P. E. (1992). Block jacobi preconditioning of the conjugate gradient method on a vector processor. International Journal of Computer Mathematics, 44(1-4), 71-89. doi:10.1080/00207169208804096Horowitz, M. (2014). 1.1 Computing’s energy problem (and what we can do about it). 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). doi:10.1109/isscc.2014.6757323Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. doi:10.1137/1.9780898718003Strzodka, R., & Goddeke, D. (2006). Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components. 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. doi:10.1109/fccm.2006.57Tadano, H., & Sakurai, T. (2008). On Single Precision Preconditioners for Krylov Subspace Iterative Methods. Lecture Notes in Computer Science, 721-728. doi:10.1007/978-3-540-78827-0_83Wulf, W. A., & McKee, S. A. (1995). Hitting the memory wall. ACM SIGARCH Computer Architecture News, 23(1), 20-24. doi:10.1145/216585.21658
Sobre el paralelismo anidado de tareas en la factorización LU de Matrices Jerárquicas
Ponencia presentada en las XXX Jornadas de Paralelismo (JP2019) y las IV Jornadas de Computación Empotrada y Reconfigurable (JCER2019) / Jornadas Sociedad de Arquitectura y Tecnología de Computadores (SARTECO, 18-19, septiembre 2019).En este artículo se presenta una versión
paralela de la factorización LU de Matrices Jerárquicas (H-matrices) provenientes de Métodos de Elementos de Contorno (BEM). Estas matrices contienen estructuras internas cuya dimensión varía durante la
ejecución de operaciones sobre las mismas, por lo que
es necesario desligar las estructuras de datos de aquellas utilizadas para representar las dependencias en las
tareas en las que se basa la implementación paralelizada. Utilizamos el modelo de programación OmpSs-2 y
su runtime para determinar el
flujo de datos intrínseco al paralelismo en tiempo de ejecución, así como
para aprovechar las dependencias débiles de tareas y
la "liberación temprana" (early release) de dependencias. Gracias a estas funcionalidades, puede acelerarse
la ejecución de la versión paralela de la H-LU y mejorarse el rendimiento
Characterization of Multicore Architectures using Task-Parallel ILU-type Preconditioned CG Solvers
Ponència presentada al 2nd Workshop on Power-Aware Computing (PACO 2017) Ringberg Castle, Germany, July, 5-8 2017We investigate the eficiency of state-of-the-art multicore processors using
a multi-threaded task-parallel implementation of the Conjugate Gradient
(CG) method, accelerated with an incomplete LU (ILU) preconditioner.
Concretely, we analyze multicore architectures with distinct
designs and market targets to compare their parallel performance and
energy eficiency
Solución de Problemas Matriciales de “Gran Escala” sobre Procesadores Multinúcleo y GPUs
Few realize that, for large matrices, many dense matrix computations achieve nearly the same performance
when the matrices are stored on disk as when they are stored in a very large main memory. Similarly, few realize that, given
the right programming abstractions, coding Out-of-Core (OOC) implementations of dense linear algebra operations (where
data resides on disk and has to be explicitly moved in and out of main memory) is no more difficult than programming
high-performance implementations for the case where the matrix is in memory. Finally, few realize that on a contemporary
eight core architecture or a platform equiped with a graphics processor (GPU) one can solve a 100, 000 × 100, 000
symmetric positive definite linear system in about one hour. Thus, for problems that used to be considered large, it is not
necessary to utilize distributed-memory architectures with massive memories if one is willing to wait longer for the solution
to be computed on a fast multithreaded architecture like a multi-core computer or a GPU. This paper provides evidence in
support of these claimsPocos son conscientes de que, para matrices grandes, muchos cálculos matriciales obtienen casi el mismo rendimiento
cuando las matrices se encuentran almacenadas en disco que cuando residen en una memoria principal muy grande. De
manera parecida, pocos son conscientes de que, si se usan las abstracciones de programacón correctas, codificar algoritmos
Out-of-Core (OOC) para operaciones de Álgebra matricial densa (donde los datos residen en disco y tienen que moverse
explícitamente entre memoria principal y disco) no resulta más difícil que codificar algoritmos de altas prestaciones para
matrices que residen en memoria principal. Finalmente, pocos son conscientes de que en una arquictura actual con 8 núcleos
o un equipo con un procesador gráfico (GPU) es posible resolver un sistema lineal simétrico positivo definido de dimensión
100,000 × 100,000 aproximadamente en una hora. Así, para problemas que solían considerarse grandes, no es necesario
usar arquitecturas de memoria distribuida con grandes memorias si uno está dispuesto a esperar un cierto tiempo para que
la solución se obtenga en una arquitectura multihebra como un procesador multinúcleo o una GPU. Este trabajo presenta
evidencias que soportan tales afirmaciones
Convolution Operators for Deep Learning Inference on the Fujitsu A64FX Processor
Ponència presentada a 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) celebrat a Bordeaux, França.The convolution operator is a crucial kernel for
many computer vision and signal processing applications that
rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has received considerable attention
in the past few years for a fair range of processor architectures.
In this paper, we follow the technology trend toward integrating long SIMD (single instruction, multiple data) arithmetic units
into high performance multicore processors to analyse the benefits of this type of hardware acceleration for latency-constrained
DL workloads. For this purpose, we implement and optimise
for the Fujitsu processor A64FX, three distinct methods for the
calculation of the convolution, namely, the lowering approach,
a blocked variant of the direct convolution algorithm, and the
Winograd minimal filtering algorithm. Our experimental results
include an extensive evaluation of the parallel scalability of these
three methods and a comparison of their global performance
using three popular DL models and a representative dataset
- …